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Abstract 

Multimodal  dialog  systems  research  at  the  University 
of  Illinois  seeks  to  develop  algprithms  and  systems 
capable  of  robustly  extracting  and  adaptively  com- 
bining information  about  the  speech  and  gestures  of 
a nave  user  in  a noisy  environment.  This  paper  will 
review  our  recent  work  in  seven  fields  related  to  mul- 
timodal semantic  understanding  of  speech:  audiovi- 
sual speech  recognition,  multimodal  user  state  recog- 
nition, gesture  recognition,  face  tracking,  binaural 
hearing,  noise-robust  and  high-performance  acoustic 
feature  design,  and  recognition  of  prosody 

1 Introduction 

The  purpose  of  this  paper  is  to  summarize  ongoing 
multimodal  speech  and  dialog  recognition  research 
at  the  University  of  Illinois.  A multimodal  speech 
recognition  system  can  be  described  in  two  distinct 
stages:  (1)  robust  audiovisual  feature  extraction,  and 
(2)  speech  and  user  state  recognition  using  dynamic 
Bayesian  networks.  Features  axe  extracted  6cm  au- 
diovisual input  in  ardet  to  optimally  represent  pho- 
netic, visemic,  gestural,  and  prosodic  information. 
Our  specific  ongoing  research  projects  include  bin- 
aural hearing  (array  processing  on  a mobile  plat- 
form), biomimetic  noise-robust  acoustic  feature  ex- 
traction, maximum  mutual  information  acoustic  fea- 
ture design,  and  face  tracking.  Customized  Dynamic 
Bayesian  networks  have  been  designed  for  three  dif- 
ferent recognition  tasks:  audiovisual  speech  recog- 
nition using  coupled  HMMs,  us®  state  recognition 
using  hierarchical  HMMs,  and  recognition  of  speak- 
ing rate  using  hidden-mode  explicit-duration  acoustic 
HMMs. 

Image  and  Speech  Processing  research  at  the  Uni- 
versity of  Illinois  is  currently  tested  in  two  ongoing 
research  prototype  environments.  The  first  research 
prototype  environment  is  an  experimental  computing 
facility  for  teaching  children  about  physics.  The  sec- 


ond research  environment  is  an  autonomous  robot, 
Hly,  who  acquires  language  through  the  semantic  as- 
sociation of  audio,  visual,  and  haptic  sensory  data 
Prior  to  implementation  on  one  or  both  of  these  plat- 
forms, most  of  our  algorithms  are  tested  using  stan- 
dard or  locally  acquired  datasets. 

2 Pre-Processing 

2.1  Binaural  Hearing 

Our  research  on  binaural  hearing  addresses  the  ex- 
traction of  noise-robust  audio  from  a two-microphone 
array  mounted  on  a physically  mobile  platform  (a 
language- learning  autonomous  robot).  The  source 
localization  algorithm  is  based  on  a two  channel 
Griffiths-Jim  beamformer  [3]  and  a new  phase  un- 
wrapping algorithm  for  accurate  estimation  of  time 
difference  of  arrival  measures  [8]  . The  new  phase  un- 
wrapping algorithm  is  trained  using  many  measure- 
ments of  TD  OAs  in  order  to  create  an  accurate  spa- 
tial map  of  TDOA  pattern  as  a function  of  arrival 
azimuth  and  elevation.  These  can  then  be  used  both 
to  cancel  interfering  noise  and  bo  get  a faithful  rep- 
resentation of  the  desired  speech  signal.  Preliminary 
results  show  that  a speech  signal  can  be  accurately 
located  in  noisy  laboratory  room  within  a few  mil- 
liseconds and  with  ten  degree  accuracy  at  a distance 
of  2-4  meters  (acoustic  far  field). 

In  the  current  implementation , detection  of  a 
speech  signal  triggers  physical  rotation  of  the  receiver 
platform  (the  robot’s  “head”)  so  that  it  faces  the  pri- 
mary talker.  By  physically  aligning  the  “head”  of  the 
robot  with  the  direction  of  primary  source  arrival,  we 
are  able  to  use  extremely  efficient  off-axis  cancellation 
algorithms  for  improved  SNB.  [9], 

2.2  Acoustic  Features 

Standard  speech  recognition  features  (including 
MFCC,  PLP,  and  liPCC)  result  in  isolated  digit 
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Figure  1:  WER:  isolated  digit  recognition  in  white 
noise  with  two  standard  feature  sets,  MFCC  and 
LPCC,  and  two  novel  feature  sets,  LPCC  with  voice 
index  and  with  frame  index  (from  [6]). 

recognition  error  rates  of  approximately  60%  at  lOdB 
SNR,  and  nearly  80%  at  OdB  SNR.  In  1992,  Med- 
dis  and  Hewitt  proposed  a bio  mimetic  method  for 
recognition  of  voiced  speech  in  high  noise  environ- 
ments [10].  Meddis  and  Hewitt  proposed  filtering 
a noisy  speech  signal  into  many  bands,  computing 
the  autocorrelation  function  (r)  in  each  sub-band, 
and  then  estimating  the  speech  autocorrelation  JJ(r) 
by  optimally  selecting  and  adding  together  the  high- 
SNR  sub-band  autocorrelations.  In  our  work  [6],  we 
have  replaced  Meddis  and  Hewitt’s  optimal  selection 
algorithm  by  an  optimal  scaling  algorithm.  Specifi- 
cally, we  estimate  the  sub-band  SNR  using  a stan- 
dard pitch  prediction  coefficient,  i.e. 

Speech  Energy  in  Band  k ^ J? k (lo) 

Tbtal  Energy  in  Band  k (0) 

where  T0  is  the  globally  optimum  pitch  period.  The 
maximum  likelihood  estimate  of  the  noise-free  speech 
signal  autocorrelation  is  then 

&(T) = (2) 

it 

In  isolated  digit  recognition  experiments,  the  use  of 
equations  1 and  2 reduced  word  error  rate  by  more 
than  a factor  of  three  in  white  noise  at  lOdB  through 
-lOdB,  and  by  more  than  a factor  of  two  in  babble 
noise  at  the  same  SNRs  (Figure  1). 

The  phonological  features  implemented  at  a speech 
landmark  influence  the  acoustic  spectrum  at  dis- 
tances of  50-100ms  [4,  19].  Complete  representation 
of  a 100ms  spectrogram  requires  a 120-dimensional 


Features 

No 

35dB 

m 

Phone 

35dB 

Bigram 

lOdB 

LPCC 

56 

40 

59 

46 

MFCC 

58 

42 

63 

48 

FM 

58 

42 

62 

46 

MMIA 

59 

43 

63 

49 

Table  1:  Phoneme  recognition  correctness  in  four  con- 
ditions. Features  selected  using  a maximum  mutual 
information  criterion  (MMIA)  provide  superior  per- 
formance in  all  four  conditions. 

acoustic  feature  vector.  It  is  not  possible  bo  accu- 
rately train  observation  PDFs  of  dimension  120  using 
existing  data  sets,  but  it  is  possible  bo  select  a sub- 
vector  using  a quantitative  optimality  criterion.  In 
our  research,  we  select  a 3 9- dimensional  feature  sub- 
vector  from  a list  of  160  candidate  features  in  order 
bo  optimize  the  mutual  information  between  features 
and  phoneme  labels  [12].  Optimality  is  determined 
using  a clean  speech  database  (TIMIT)  with  no  lan- 
guage model,  but  the  resulting  optimality  generalizes. 
As  shown  in  Table  1,  the  resulting  MMIA  (maximum 
mutual  information  acoustic)  feature  vector  outper- 
forms all  standard  feature  vectors  under  at  least  three 
conditions:  in  quiet  and  at  10 dB  SNR,  without  a lan- 
guage model  and  with  an  optimized  phoneme  bigram. 
Larger  improvements  may  be  obtained  by  testing  the 
5-10  best  feature  vectors  generated  during  the  mutual 
information  search.  The  best  recognition  accuracy, 
obtained  using  the  feature  set  with  second-best  mu- 
tual information,  was  62%  with  no  language  model 
in  quiet  conditions. 

2.3  Face  Tracking 

Research  has  shown  that  facial  and  vocal-tract  mo- 
tions are  highly  correlated  during  speech  produc- 
tion [20],  Speech  recognition  using  both  audio /visual 
features  is  shown  to  be  more  robust  in  noisy  environ- 
ments [5],  Analysis  of  non-rigid  human  facial  motion 
is  a key  component  for  acquiring  visual  features  for 
audio/ visual  speech  recognition. 

In  the  past  several  years,  research  in  our  group  has 
led  to  a robust  3D  facial  motion  tracking  system  [16]. 
A 3D  non-rigid  facial  motion  model  is  manually  con- 
structed based  on  piecewise  Bezier  volume  deforma- 
tion model  (PBVD).  It  is  used  bo  constrain  the  noisy 
low-level  optic  al  flow . The  tr  acking  is  done  in  a multi- 
resolution  manner  such  that  higher  speed  could  be 
achieved.  It  runs  at  5 fps  on  an  SGI  Onyx2  machine. 
This  tracking  algorithm  has  been  successfully  used  for 
audio- visual  speech  recognition  and  bimodal  emotion 
recognition. 
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Figure  2:  Demonstration  of  our  fare  tracking  system. 


2.4  Gesture  Recognition 

Hand  gestures  are  capable  of  delivering  information 
not  presented  in  speech  [14].  Controlling  gesture  can 
be  used  to  provide  commands  to  the  system.  Nav- 
igation gestures  provide  information  for  manipulat- 
ing virtual  objects,  and  for  selecting  point  objects  or 
large  regions  on  the  screen.  Conversational  gestures 
provide  subtle  cues  to  sentence  meaning  in  normal 
human  interaction.  Automated  hand  tracking  and 
gesture  recognition  can  help  improve  the  performance 
of  human -machine  interface. 

We  ha’.e  investigated  both  appearance-based  ges- 
ture recognition  (using  neural  network-based  pat- 
tern recognition  techniques)  and  model-based  gesture 
recognition  [18,  17].  In  model-based  recognition,  the 
configuration  of  a hand  model  is  first  determined  by 
providing  a set  of  joint  angle  parameters  The  2D 
projection  of  this  hand  model,  determined  by  the 
translation  and  orientation  of  the  model  relative  to 
a viewing  portal,  is  compared  with  the  hand  image 
from  input  video.  Estimate  of  the  correct  input  hand 
configuration  is  determined  by  the  best  matching  pro- 
jection. A complete  description  of  the  global  hand 
position  and  all  finger  joint  angles  requires  specifica- 
tion of  21  joint  angles.  Using  both  known  anatom- 
ical constraints  and  PCA  to  reduce  dimensionality, 
we  can  initially  reduce  the  dimensionality  of  the  ges- 
tural description  from  21  to  7 independent  dimen- 
sions while  keeping  95%  of  the  information.  In  this 
7-dimensional  space,  it  is  possible  to  define  28  bay- 
sis  configurations,  consisting  of  the  configurations  in 
which  each  tings'  is  either  fully  folded  or  completely 
extended.  A close  examination  of  the  motion  trajec- 
tories between  these  basis  states  shows  that  natural 
hand  articulations  seem  bo  he  largely  in  the  linear 


manifold  spanned  by  pairs  of  basis  states.  We  be- 
lieve that,  based  on  these  preliminary  results,  it  will 
be  possible  bo  map  all  observed  gestures  into  a low- 
dimensional gestural  manifold,  resulting  in  efficient 
and  accurate  gesture  recognition. 


3 Dynamic  Bayesian  Networks 

3.1  Lip  Reading 

The  focus  of  our  research  in  lip  reading  is  a novel  ap- 
proach to  the  fusion  problem  in  audio-visual  speech 
processing  and  recognition.  Our  fusion  algorithm  is 
built  upon  the  framework  of  coupled  hidden  Markov 
models  (CHMMs).  CHMMs  are  probabilistic  in- 
ference graphs  that  have  hidden  Markov  models 
(HMMs)  as  sub-graphs.  Chains  in  the  correspond- 
ing inference  graph  are  coupled  through  matrices  of 
conditional  probabilities  modeling  temporal  depen- 
dencies between  their  hidden  state  variables.  The 
coupling  probabilities  are  both  cross  chain  and  cross 
time  The  later  is  essential  far  capturing  temporal  in- 
fluences between  chains.  In  a himodal  speech  recog- 
nition system,  two-chain  CHMMs  are  deployed,  with 
one  chain  being  associated  with  the  acoustic  obser- 
vations, the  other  with  the  visual  features.  Under 
this  framework,  the  fusion  of  the  two  modalities  takes 
place  during  the  classification  stage  The  particular 
topology  of  the  CHMM  ensures  that  the  Learning  and 
classification  are  based  on  the  audio  and  visual  do- 
mains jointly,  while  allowing  asynchronies  between 
the  two  information  channels. 

In  essence,  CHMMs  are  directed  graphical  models 
of  stochastic  processes  and  are  a special  type  of  Dy- 
namic Bayesian  Networks  (DBNs).  The  DBNs  gen- 
eralize the  HMMs  by  representing  the  hidden  states 
as  state  variables,  and  allow  the  states  to  have  com- 
plex interdependencies.  The  DBN  point  of  view  fia- 
cilitates  the  development  of  inference  algorithms  for 
the  CHMMs.  Specifically,  two  inference  algorithms 
are  proposed  in  this  work.  Both  of  the  algorithms  are 
exact  methods.  The  first  is  an  extension  cf  the  well- 
known  forward-backward  algorithm  from  the  HMM 
literatures.  The  second  is  a strategy  of  converting 
CHMMs  bo  mathematically  equivalent  HMMs,  and 
carrying  out  learning  in  the  transformed  models. 

The  benefits  of  the  proposed  fusion  scheme  are 
confirmed  by  a series  of  preliminary  experiments 
on  audio-visual  speech  recognition.  Visual  fea- 
tures based  on  lip  geometry  are  used  in  the  exper- 
iments. Furthermore,  comparing  with  an  acoustic- 
only  A SR  sysbem  trained  using  only  the  audio  chan- 
nel of  the  same  dataset,  the  himodal  sysbem  consis- 
tently demonstrates  improved  noise  robustness  across 
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SNR 

lOdB 

20  dB 

30  dB 

A 

4.03 

43.61 

99.10 

V 

42.95 

42.95 

42.95 

A+V 

10.58 

72.79 

99.74 

CHMM 

35.32 

86.58 

93.32 

Table  2:  Result  of  experiments  in  audiovisual  speech 
recognition  (measured  in  %word  accuracy).  A indi- 
cates the  audio-only  system;  V indicates  the  visual- 
only  system,  A+V  indicates  a bimodal  sysbem  using 
early  integration,  and  CHMM  indicates  the  CHMM- 
based  system. 


a wide  range  of  SNR.  Levels. 

3.2  Prosody 

Our  approach  to  the  recognition  of  prosody  is  the 
use  of  a “hidden  mode  variable”  [13]  bo  condition  the 
explicit  duration  PDFs  of  a CVDHMM  [7].  In  our 
prototype  algorithm,  the  state  space  consists  of  par- 
allel phonetic  state  variables  (g*)  and  prosodic  state 
variables  (fcf).  The  dwell  time  of  state  g*  is  a random 
variable  dq  with  PDF  depending  p(dg|g,fc).  At  the 
end  of  the  specified  dwell  time,  the  phonetic  variable 
always  changes  state  (no  self-loops),  but  the  prosodic 
state  variable  may  or  may  not  change  state.  Thus, 
for  example,  if  (ft*  eslow,  medium,  fast)  represents 
speaking  rate,  it  may  be  reasonable  bo  allow  fc*  bo 
change  state  at  any  ward  boundary  with  a small  prob- 
ability. 

In  orda-  to  allow  efficient  experiments,  we  have 
modified  HTK  bo  make  use  of  Ferguson’s  EM  al- 
gorithm for  explicit-duration  HMMs  [1,  2].  Fergu- 
son’s algorithm  is  an  order  of  magnitude  faster  than 
most  algprithms  for  the  explicit-duration  HMMs. 
The  computational  complexity  of  the  algorithm  is 
0{NT(N  + T)),  where  N is  the  number  of  states, 
T is  the  number  of  frames  in  the  input  signal,  and 
(0{N3T))  is  the  complexity  of  an  HMM  without  ex- 
plicit duration.  The  forward  algorithm  computes 

Q>(j)  = P(Oi, ...  ,Ot,j  commences  at  t + 1) 

= Y Qf(Oa;j 

3 

QTf(i)  = P(Oi, . . . ends  a.t  t) 

d 

3.3  User  State  Recognition 

Integration  of  a large  number  of  sources  for  the  pur- 
pose of  multimodal  user-state  recognition  can  be  ac- 
complished using  a his'archical  dynamic  Bayesian 
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Figure  3:  Architecture  for  detecting  events  in  the  of- 
fice scenario 


network  (figure  3).  In  a hierarchical  DBN,  each 
modality  (audio,  lip  reading,  gesture,  and  prosody) 
is  modeled  using  a mod  ah  ty- dependent  HMM.  Each 
modality-dependent  HMM  is  searched  in  order  bo 
gena'ate  the  N transcriptions  that  best  match  the 
observed  data  in  the  given  modalily.  The  likelihood 
of  each  transcription  is  then  estimated  using  a con- 
strained forward-backward  algorithm,  generating  the 
probability  of  state  residency  during  every  frame. 
These  probabilities  are  fed  forward  bo  the  supervisor 
HMM,  which  integrates  them  bo  determine  a single 
transcription  of  the  sentence  in  order  bo  maximize  the 
a posteriori  transcription  probability.  By  imposing  a 
prior  on  the  probability  distributions  learned  by  the 
model  for  the  purpose  of  increasing  conditional  en- 
tropy, we  have  demonstrated  a 10%  increase  in  user 
state  classification  performance  [15,  11]. 


4 Conclusions 


Our  research  is  intended  bo  elucidate  both  the  the- 
oretical and  the  practical  requirements  for  effective 
multimodal  speech  understanding  systems.  The  use 
of  speech  in  multimodal  systems  will  increase  our  the- 
oretical understanding  of  the  problems  of  sensor  fu- 
sion and  representations  of  multimodal  signals.  In- 
creased theoretical  understanding,  in  turn,  will  en- 
able us  to  produce  practical  results  that  can  be  di- 
rectly used  in  state-of-the-art  speech  recognition  sys- 
tems and  as  part  of  larger  systems  for  advanced 
human-machine  communication . 
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